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Abstract 

We consider the problem of subspace estimation in a Bayesian setting. Since we are operating 
in the Grassmann manifold, the usual approach which consists of minimizing the mean square error 
(MSE) between the true subspace U and its estimate U may not be adequate as the MSE is not 
the natural metric in the Grassmann manifold. As an alternative, we propose to carry out subspace 
estimation by minimizing the mean square distance (MSD) between U and its estimate, where the 
considered distance is a natural metric in the Grassmann manifold, viz. the distance between the 
projection matrices. We show that the resulting estimator is no longer the posterior mean of U 
but entails computing the principal eigenvectors of the posterior mean of UU^ . Derivation of the 
MMSD estimator is carried out in a few illustrative examples including a linear Gaussian model 
for the data and a Bingham or von Mises Fisher prior distribution for U . In all scenarios, posterior 
distributions are derived and the MMSD estimator is obtained either analytically or implemented via 
a Markov chain Monte Carlo simulation method. The method is shown to provide accurate estimates 
even when the number of samples is lower than the dimension of U . An application to hyperspectral 
imagery is finally investigated. 
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I. Problem statement 

In many signal processing applications, the signals of interest do not span the entire observation 
space and a relevant and frequently used assumption is that they evolve in a low-dimensional subspace 
ifTl . Subspace modeling is accurate when the signals consist of a linear combination of p modes in a 
A^-dimensional space, and constitute a good approximation for example when the signal covariance 
matrix is close to rank-deficient. As a consequence, subspace estimation plays a central role in 
recovering these signals with maximum accuracy. An ubiquitous solution to this problem is to resort 
to the singular value decomposition (SVD) of the data matrix. The SVD emerges naturally as the 
maximum likelihood estimator in the classical model Y = US + N, where Y stands for the N x K 
observation matrix, U is the (deterministic) N x p matrix, with p < N, whose columns span the 
p-dimensional subspace of interest, S is the p x K (deterministic) waveform matrix and AT" is the 
additive noise. The p principal left singular vectors of Y provide very accurate estimates of a basis 
for the range space TZ (U) of U, and have been used successfully, e.g., in estimating the frequencies 
of damped exponentials or the directions of arrival of multiple plane waves, see ||2l, 131 among others. 
However, the SVD can incur some performance loss in two main cases, namely when the signal-to- 
noise ratio (SNR) is very low and thereof the probability of a subspace swap or subspace leakage 
is high lIll-llTl. A second case occurs when the number of samples K is lower than the subspace 
dimension p: indeed, Y is at most of rank K and information is lacking about how to complement 
TZ (Y) in order to estimate TZ{U). 

Under such circumstances, a Bayesian approach might be helpful as it enables one to assist 
estimation by providing some statistical information about U. We investigate such an approach herein 
and assign to the unknown matrix U an appropriate prior distribution, taking into account the specific 
structure of U. The paper is organized as follows. In section |II1 we propose an approach based on 
minimizing a natural distance on the Grassmann manifold, which yields a new estimator of U. The 
theory is illustrated in section |lll] where the new estimator is derived for some specific examples. In 
section UV] its performance is assessed through numerical simulations, and compared with conventional 
approaches. Section |V] studies an application to the analysis of interactions between pure materials 
contained in hyperspectral images. 

II. Minimum mean square distance estimation 

In this section, we introduce an alternative to the conventional minimum mean square error (MMSE) 
estimator, in the case where a subspace is to be estimated. Let us consider that we wish to estimate 
the range space of U from the joint distribution p (Y, U) where Y stands for the available data 
matrix. Usually, one is not interested in U per se but rather in its range space TZ (U), and thus we 
are operating in the Grassmann manifold Gn,p, i-fi-, the set of p-dimensional subspaces in El. 
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It is thus natural to wonder whether the MMSE estimator (which is the chief systematic approach 
in Bayesian estimation ||9|) is suitable in Gn p- The MMSE estimator of a vector 6 minimizes 

. ' r . 2^ 

the average Euclidean distance between 9 and 0, i.e., E 



e-G 



Despite the fact that this 



distance is natural in an Euclidean space, it may not be the more natural metric in Gm,p- In fact, the 

1 /2 1 - -J , 

natural distance between two subspaces IZ (Ui) and TZ {U2) is given by {Ylk=i ^k) lEl where 6k 
are the principal angles between these subspaces, which can be obtained by SVD of U'^Ui where 
Ui and U2 denote orthonormal bases for these subspaces lITOll . The SVD of U2U1 is defined as 
U2U1 = Xdiag (cos ^1, ■ ■ ■ , cos 9p) Z'^, where X and Z are two pxp unitary matrices. Therefore, 



it seems more adequate, rather than minimizing 



as the MMSE estimator does, to minimize 



U-U 

the natural distance between the subspaces spanned by U and U. Although this is the most intuitively 
appealing method, it faces the drawback that the cosines of the angles and not the angles themselves 
emerge naturally from the SVD. Therefore, we consider minimizing the sum of the squared sine of 
the angles between [/ and U, since for small 9^., sin^^ ~ 6^. As argued in jSl, lITOl . this cost function 
is natural in the Grassmann manifold since it corresponds to the Frobenius norm of the difference 
between the projection matrices on the two subspaces, viz X]!fe=i ^i^^ = UU — UU'^ = 
(P (tj, . It should be mentioned that our approach follows along the same principles as in ifTTl 
where a Bayesian framework is proposed for subspace estimation, and where the author considers 
minimizing d (u, . Hence the theory presented in this section is similar to that of ifTTl . with some 
exceptions. Indeed the parameterization of the problem in IfTTl differs from ours and the application 
of the theory is also very different, see the next section. 

Given that (u, = 2 (p - Tr ^U^UU^U^^ , we define the minimum mean-square distance 
(MMSD) estimator of U as 



Since 



it follows that 

^mmsd 



mmsd 



ar 



gmaxE|TY|[/^J7(7^L/}} . 



}} = J J Tr^U^UU^U^p{U\Y)dU 



p{Y)dY 
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u 



U UU'U\p{U\Y)dU 

T 



UU'p{U\Y)dU 



u 



(3) 



Therefore, the MMSD estimate of the subspace spanned by U is given by the p largest eigenvectors 
of the matrix / UU'^p {U\Y) dU, which we denote as 

n<! / UU^piU\Y)du] . (4) 



u 



mmsd 
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In other words, MMSD estimation amounts to find the best rank-p approximation to the posterior 
mean of the projection matrix UU^ on TZ {U). For notational convenience, let us denote M (Y) = 
f UU'^p {U\Y)dU. Except for a few cases where this matrix can be derived in closed-form (an 
example is given in the next section), there usually does not exist any analytical expression for 
M (Y). In such situation, an efficient way to approximate the matrix M (Y) is to use a Markov 
chain Monte Carlo simulation method whose goal is to generate random matrices U drawn from the 
posterior distribution p{U\Y), and to approximate the integral in (|4l) by a finite sum. This aspect 
will be further elaborated in the next section. Let M (Y) = U m{Y)Lm{Y)UIj{Y) denote the 
eigenvalue decomposition of M (Y) with Lm{Y) = diag {£i{Y),£2{Y), ■ ■ ■ ,iN{Y)) and £i{Y) > 
h{Y) > ■ ■ ■ > £n{Y). Then the average distance between f/mmsd and U is given by 

E {d^ {Umn..6, u)]=2p-2 j y TV {j7LsdC^C^''t^mmsd} P {U\Y) dU p (Y) dY 

= 2p - 2 y Tr {L/Ld^ (Y) U^^,,] p (Y) dY 

= 2p-2^ £k{Y)p{Y)dY. (5) 
k=i-' 

The latter expression constitutes a lower bound on E (^U, | and is referred to as the Hilbert- 
Schmidt bound in lITTI . |[T2l . As indicated in these references, and similarly to M (Y), this lower 
bound may be difficult to obtain analytically. 

The MMSD approach can be extended to the mixed case where, in addition to U, a parameter 
vector 6 which can take arbitrary values in needs to be estimated jointly with U. Under such 
circumstances, one can estimate U and as 



Un.m.d, ^mmsd) = arg min E | -Tr { U^UU^ll] + {o - of {o - 6 



(6) 



Doing so, the MMSD estimator of U is still be given by dH) while the MMSD and MMSE estimators 
of 6 coincide. 

Remark 1 The MMSD approach differs from an MMSE approach which would entail calculating the 
posterior mean of U , viz J Up {U\Y) dU. Note that the latter may not be meaningful, in particular 
when the posterior distribution p{U\Y) depends on U only through UU'^, see next section for an 
example. In such a case, post-multiplication of U by any p x p unitary matrix Q yields the same 
value of p {U\Y). Therefore averaging U over p{U\Y) does not make sense while computing ^ is 
relevant. On the other hand, if p {U\Y) depends on U directly, then computing the posterior mean of 
U can be investigated: an example where this situation occurs will be presented in the next section. 
As a final comment, observe that J Up {U\Y) dU is not necessarily unitary but its range space can 
be used to estimate TZ (U). 
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Remark 2 We open a parenthesis here regarding the framework of this paper. Although it is not 
directly related to this paper (we do not address optimization problems here) it is interesting to note 
the recent growing interest in optimization problems on special manifolds, especially on the Stiefel 
manifold (the set of x p matrices U such that U'^U = I) and the Grassmann manifold, see the 
excellent tutorial paper by Edelman et al. HI as well as |[T3l . |[T4l . and ifTSl - lfTTl for signal processing 
applications. These references show the interest of taking into account the underlying geometry of 
the problem, as we attempt to do herein. 

III. Illustration examples 

In this section we illustrate the previous theory on some examples, including the conventional linear 
Gaussian model (conditioned on U) and a model involving the eigenvalue decomposition of the data 
covariance matrix. As a first step, we address the issue of selecting prior distributions for U and then 
move on to the derivation of the MMSD estimator. 



A. Prior distributions 

A crucial step in any Bayesian estimation scheme consists of selecting the prior distribution for the 
variables to be estimated. We focus here on distributions on the Stiefel or Grassmann manifold, 
depending whether we consider the matrix U itself or its range space. There exist only a few 
distributions on the Stiefel or Grassmann manifolds, the most widely accepted being the Bingham or 
von Mises Fisher (vMF) distributions ifTSl . |[T9l . which are given respectively by 

where etr {.} stands for the exponential of the trace of the matrix between braces, A is an A^ x 
symmetric matrix, is an A^ x p arbitrary matrix, and o^i (a; X), iFi (a, 6; X) are hypergeometric 
functions of matrix arguments, see e.g. |[T9l for their definitions. We will denote these distributions as 
B {A) and vMF {F), respectively. Observe that the Bingham distribution depends on UU'^ only, and 
can thus be viewed as a distribution on the Grassmann manifold ifTSl . |[T9l while the vMF distribution 
depends on U and is a distribution on the Stiefel manifold. In our case, in order to introduce some 
knowledge about U, we assume that it is "close" to a given subspace spanned by the columns of an 
orthonormal matrix tj, and hence we consider two possible prior distributions for U , namely 

TTB ([/) oc etr I kU^UU^U^ (9) 

TTvMF {U) oc etr { kU^U} (10) 
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where oc means "proportional to". The distribution in ^ is proportional to the sum of the squared 
cosine angles between TZ (U) and 71 (U) while tTvMF (U) is proportional to the sum of the cosine 
angles between TZ (U) and TZ [U). Note that /t is a concentration parameter: the larger k the more 
concentrated around U axe the subspaces U. The difference between the two distributions is the 
following. In the Bingham distribution only 71{U) and TZ (U) axe close (at least for large values of 
k) since ttb (U) is invariant to post-multiplication of U by any p x p unitary matrix Q . Hence U 
is not necessarily close to U. In contrast, under the vMF prior distribution, U and U axe close. For 
illustration purposes. Figure [T] displays the average fraction of energy of J7 in 7?- (U) defined as 

AFE{U,U) =E^Tr^U^UU^U^ /pY (11) 

As can be observed from these figures, both distributions allow the distance between U and U to be 
set in a rather flexible way. Their AFE is shown to be identical for small values of the concentration 
parameter but, when k increases, the AFE of the vMF distribution increases faster. Additionally, even 

Average fraction of energy of U in R(Ubar) 



Bingham 

- ' - von Mises Fisher 




10 20 30 40 50 60 70 80 90 100 

Concentration parameter k 



Fig. 1. Average fraction of energy of [/ in 7?, (U) versus n. N — 20, p = 5. 

if the AFE are close for small values of k, the distributions of the angles between TZ {U) and TZ (U) 
exhibit some differences, as shown in Figures |2] and |3] which display the probability density functions 
of these angles for k = 20. 

B. Linear model 

In order to illustrate how the previous theory can be used in practice, we first consider a simple 
example, namely a linear Gaussian model (conditioned on U), i.e., we assume that the data follows 
the model Y = US + N where the columns of N are independent and identically distributed (i.i.d.) 
Gaussian vectors with zero-mean and (known) covariance matrix o"^/. We assume that no knowledge 
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Distribution of the angies between U and Ubar - Bingiiam prior 




Fig. 2. Distribution of the angles between TZ (U) and TZ (U) for a Bingham distribution. N = 20, p = 5 and k — 20. 



Distribution of tiie angles between U and Ubar - vMF prior 




Fig. 3. Distribution of the angles between TZ (U) and TZ (U) for a von Mises Fisher distribution. A'^ — 20, p = 5 and 
K = 20. 



about S is available and hence its prior distribution is set to vr (S) oc 1. Therefore, conditioned on 
U we have 



p{Y\U) = J p{Y\U,S)7r{S)dS 
oc j etvf^-^iY -USf {Y-US)^dS 



(12) 
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When U follows the Bingham prior distribution, the posterior distribution of U , conditioned on Y 
is given by 



p{U\Y) a etr <^ C/ 



U\ (13) 



which is recognized as a Bingham distribution with parameter matrix kUU + i^YY , i.e., 
U\Y B (^kUIJ^ + 2^YY^^ . For such a Bingham distribution, it turns out that the eigenvectors 
UU' p {U\Y) dU coincide with those of kUU + ^YY^ , with the same ordering of their 
eigenvalues, see Appendix lAl for a proof. Therefore the MMSD estimator is obtained in closed-form 

as 

t/mmsd-LM-B = Vp ^^kUU^ + ^^^^} ' (^4) 

Therefore, the MMSD estimator has a very simple form in this case. It consists of the principal 

- - T 

subspace of a (weighted) combination of the a priori projection matrix UU and the information 
brought by the data through YY^ . Observe that, in this particular case of a Bingham posterior, the 
MMSD estimator coincides with the maximum a posteriori (MAP) estimator. 

Let us now consider the case where the prior distribution of U is vMF, and contrast it with the 
previous example. Using (fT2l) along with along with (fTOl ). it follows that the posterior distribution 
now writes 

p{U\Y)ocetT!^KU^Lf + ^U^YY^U^ (15) 

which is referred to as the Bingham-von-Mises-Fisher (BMF) distribution with parameter matrices 
YY^, and kU respectiveljQ. Although this distribution is known ifTOll . to our knowledge, there 
does not exist any analytic expression for the integral in (01) when U\Y has the BMF distribution 
([T5]| . Therefore, the MMSD estimator cannot be computed in closed-form. In order to remedy this 
problem, a Markov chain Monte Carlo simulation method can be advocated ll20l . ||2TI to generate a 
large number of matrices [/^"^ drawn from (fTSl ). and to approximate ^ as 

Uramsa-LM-.MF ^Pplj^ [/W(7(")" I . (16) 

In ([T6l ). A'bi is the number of bum-in samples and N.,. is the number of samples used to approximate 
the estimator. An efficient Gibbs sampling scheme to generate random unitary matrices drawn from 
a BMF(A,B,C) distribution with arbitrary full-rank matrix A was proposed in ll22l . It amounts 
to sampling successively each column of U by generating a random unit norm vector drawn from 
a (vector) BMF distribution. In our case, A = YY'^ whose rank is mm{K, N) and hence A is 
rank-deficient whenever K < N, a. case of most interest to us. Note also that to generate matrices U 



'The matrix X is said to have a BMF {A, B, C) distribution -where A is an TV x A'^ symmetric matrix, S is a p x p 
diagonal matrix and C is an x p matrix- if p{X) oc etr {C^X -I- BX^AX}. 



_ — ^ 

drawn from the Bingham distribution in Q, we need to consider A = UU which has rank p < N. 



Therefore, the scheme of II221I needs to be adapted in order to generate random matrices drawn from 
([TSl l. In Appendix |Bj we review the method of |[22]| and show how it can be modified to handle the 
case of a rank-deficient matrix A. 

Remark 3 Interestingly enough, the above estimator in (fT6l ) is the so-called induced arithmetic mean 
(lAM) ||23l of the set of unitary matrices [/("^ n = iVbi + 1, • • • , iVbi + ^^r- It differs from the Karcher 
mean of the set U^^\ n = Nb\ + !,••• , A'bi + Nr, which truly minimizes the sum of the distances to 
all U^"'\ However, the Karcher mean may not exist and requires iterative schemes to be computed 
ll24l while the lAM is straightforward to compute. 



Remark 4 In the particular case where U has a Bingham prior distribution, the MAP estimator of 
U and its MMSD estimator are equal. This is no longer true when U has a vMF prior distribution, 
and hence a BMF posterior distribution. The mode of the latter is not known in closed-form either. 
However, it can be approximated by selecting, among the matrices generated by the Gibbs sampler, 
the matrix which results in the largest value of the posterior distribution. 

C. Covariance matrix model 

We now consider a more comphcated case where Y , conditioned on U and A, is Gaussian 
distributed with zero-mean and covariance matrix 

R=^{YY^] = UAU^ + all (17) 

where U is an orthonormal basis for the signal subspace, A is the diagonal matrix of the eigenvalues 
and fj^ stands for the white noise power which is assumed to be known here. As it will be more 
convenient and more intuitively appealing, we re-parametrize the covariance matrix as follows. The 
inverse of R can be written as 



= U 



a-^I -a"^UA(A + all) ^ 



- vl -vU {I -T)U^ (18) 



where v = , T = diag (7) with 7 



71 72 ••• 7p 



T 



and 



The idea is to parametrize the problem in terms of U and T rather than U and A. The interest of this 
transformation is twofold. First, it enables one to express all eigenvalues with respect to the white noise 
level. Indeed, one has R = iy-^U±Ul+u~^UT~'^U'^ where U± is an orthonormal basis for TZ (U) 



a: 



2 
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and hence the 7^8 are representative of the scaUng between the "signal" eigenvalues and the noise 
eigenvalues. In fact, they carry information about the signal-to-noise ratio since 7/; = (l + ^) ^ and 
^ represents the SNR of the k-th signal component. Second, this new parametrization will facilitate 
derivation of the conditional distributions required for the Gibbs sampler. 
Since Y conditioned on U and 7 is Gaussian, it follows that 



p{Y\U,j) = i27r) 



-NK/2 



R\ 



-K/2 



etr 



-Y^R-'Y 
2 



From R-^ = uU iU\ + uUTU'^ , it ensues that = \T\ and hence 

p(l"|?7,7) oc irl^/^etr [vl - uU {I -T)U^] l^j 



(20) 



(21) 



Let us now consider the prior distributions for U and 7 . We will consider either a Bingham or vMF 
distribution for U. As for 7, we assume that 7^ are a priori independent random variables uniformly 
distributed in the interval [7_,7+], i.e., 



(22) 



k=l 



The value of 7+ [respectively 7_] can be set to 1 [respectively 0] if a non-informative prior is desired. 
Otherwise, if some information is available about the SNR, 7_ and 7-|_ can be chosen so as to reflect 
this knowledge since 7^ = (1 + SNRk)^^: 7+ [resp. 7_] rules the lowest [resp. highest] value of 
the SNR, say SNR^ [resp. SNR+]. 

With the Bingham assumption for vr ([/), the joint posterior distribution of U and 7 is 

p {U, f\Y) a p {Y\U, 7) vr (U) tt (7) 



oc ir 



K/2 



n^[7-,7+](Tfc) 



\k=l 



X etr <! K.U'^UU" U + -Y' U (I -r)U'Y 



■}■ 



(23) 



In order to come up with the posterior distribution of U only, we need to marginalize (1231 ) with 



respect to Let Z = Y^ U 

p{U\Y) = J p{Un\Y)dj 



Zl Z2 



. Then, from (1231) one has 



oc etr ^kU^UU'^U + ^U^YY^U^ 
xH/ Tf''^ exp I - ^7fc II Zfcf I d7fc 

k=l '1'" 



OC etr <i kU^UU^ U + -U' YY' U 



X n ii^'^i 

k=l 



-2(l+X/2) 



V H „2 K 



/ ^ II l|2 1^^ 



(24) 
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where 7(2;, a) = j^t°-~^e~^dt is the incomplete Gamma function. Unfortunately, the above distribu- 
tion does not belong to any known family and it is thus problematic to generate samples drawn from 
it. Instead, in order to sample according to (|23] ). we propose to use a Gibbs sampler drawing samples 
according to p{U\Y,'^) and p{'^f^\Y,U) for A: = 1, • • • ,p. From (1231 ). the conditional distribution 
of U is 

p {U\Y, 7) oc etr ^kU^UU^U + ^ (/ - T) U^YY^U^ (25) 
which is recognized as a (modified) Bingham distributiorj^ 

U\Y,-fr^B(uU^,Kl,YYT,'^{I-T)y (26) 



Let us now turn to the conditional distribution of 'y\Y,U. From (1231 ) one has 



ph\Y,U) a |r|^^/%tr{-^ZrZ^} ni[7-,7.](7fc) 

\A:=1 y 

P 



p 

oc 

k=l 



(27) 



which is the product of independent gamma distributions with parameters ^+1 and | H^^fc |p, truncated 
in the interval [7., 7+]. We denote this distribution as 7^ ~ Gt (^y > 7-i 7+^ Random 

variables with such a distribution can be efficiently generated using the accept-reject scheme of |[25l . 

The above conditional distributions can now be used in a Gibbs sampler, as described in Table Jl 
When U has a vMF prior distribution, it is straightforward to show that U , conditioned on Y and 7, 
follows a BMF distribution U\Y BMF {YY'^ , ^{I -T), kU) while the posterior distribution 
of 'y\Y ,U is still given by (|27] ). Therefore line [2] of the Gibbs sampler in Table |I] just needs to be 
modified in order to handle this case. 



Input: initial values U^ '\ 7'°' 
1; for n = 1, ■ • • ,Nbi + Nr do 

2: sample LT*"' from 13 (^nl, UU'^ , f (-f - T*"-^' j , VY^j in l|25ll. 

3: for fc = 1, • ■ • ,p, sample 7^"' from Gt (^f + 1, | || V^^"' ||' , 7-, 7+^ in (|27}. 
4: end for 

Output: sequence of random variables C/*^"' and 7'"' 

TABLE I 
Gibbs sampler 



~ B (Ai, Bi, A2, B2) ^ p{X) oc etr {BiX^AiX + BiX'^ A2X} 
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IV. Simulations 



In this section we illustrate the performance of the approach developed above through Monte 
Carlo simulations. In all simulations = 20, p = 5 and k = 20. The matrix S is generated from 
a Gaussian distribution with zero-mean and co variance matrix a^I and the signal-to-noise ratio is 



iterations in the Gibbs sampler is set to iVbi = 10 and Nr = 1000. The MMSD estimator (01) is 
compared with the MAP estimator, the MMSE estimator, the usual SVD-based estimator and the 
estimator U = U that discards the available data and use only the a priori knowledge. The latter is 
referred to as "Ubar" in the figures. The estimators are evaluated in terms of the fraction of energy 



A. Linear model 

We begin with the linear model. Figures |4] to |7] investigate the influence of K and SNR onto the 
performance of the estimators. Figures |4] and [5] concern the Bingham prior while the vMF prior has 
been used to obtain Figures |6] and |7] From inspection of these figures, the following conclusions can 



• the MMSD estimator performs better than the estimator U = U, even at low SNR. The 
improvement is all the more pronounced that K is large. Therefore, the MMSD estimator makes 
a sound use of the data to improve accuracy compared to using the prior knowledge only. 

• the MMSD estimator performs better than the SVD, especially at low SNR. Moreover, and this 
is a distinctive feature of this Bayesian approach, it enables one to estimate the subspace even 
when the number of snapshots K is less than the size of the subspace p. 

• for a Bingham prior, the MMSE performs very poorly since the posterior distribution of U 
conditioned on Y depends on UU'^ only. Hence, averaging the matrix U itself does not make 
sense, see our remark [T] In contrast, when U has a vMF prior, the posterior depends on both 
U and UU'^: in this case, the MMSE performs well and is close to the MMSD. Note however 
that the vMF prior is more restrictive than the Bingham prior. 

• the MMSD estimator also outperforms the MAP estimator. 

As a conclusion, the MMSD estimator performs better than most other estimators in the large majority 



defined as SNR = 10 log (o-g/cr^). The matrix U is generated from the Bingham distribution Q 
or the vMF distribution (ITOl ) and, for the sake of simplicity, 11=1^ . The number of burn-in 




be drawn: 



of cases. 



B. Covariance matrix model 

We now conduct simulations with the covariance matrix model. The simulation parameters are 
essentially the same as in the previous section, except for the SNR. More precisely, the random 
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variables 7^. are drawn from the uniform distribution in (l22l ) where 7_ and 7+ are selected such that 
SNR^ = 5dB and SNR^ = lOdB. The results are shown in Fig. [8] for the Bingham prior and Fig. 
|9] for the vMF prior. They corroborate the previous observations made on the linear model, viz that 
the MMSD estimator offers the best performance over all methods. 

V. Application to hyperspectral imagery 

In this section, we show how the proposed subspace estimation procedure can be efficiently used for 
an application to multi-band image analysis. For several decades, hyperspectral imagery has received 
considerable attention because of its great interest for various purposes: agriculture monitoring, 
mineral mapping, military concerns, etc. One of the crucial issue when analyzing such image is 
the spectral unmixing which aims to decompose an observed pixel ifi into a collection of R = p+l 
reference signatures, mi, . . . ,m/j (called endmembers) and to retrieve the respective proportions of 
these signatures (or abundances) ai £, . . . , a^ i in this pixel |[26l . To describe the physical process that 
links the endmembers and their abundances to the measurements, the most widely admitted mixing 
model is linear 

R 

yi = '^ a-r/rrir (28) 

r=l 

where G is the pixel spectrum measured in N spectral bands, m,. G (r = 1, . . . 
are the R endmember spectra and ar/ (r = 1, . . . , i?) are their corresponding abundances. Due to 
obvious physical considerations, the abundances obey two kinds of constraints. Since they represent 
proportions, they must satisfy the following positivity and additivity constraints 

aj.£ > 0, r = 1, . . . , i2, 
< ' (29) 

Let now consider L pixels j/^, . . . , of an hyperspectral image induced by the linear mixing model 
(LMM) in (1281 ) with the abundance constraints ( [291 ). It is clear that the dataset formed by these L pixels 
lies in a lower-dimensional subspace U dW . More precisely, in this subspace U, the dataset belongs 
to a simplex whose vertices are the endmembers mi, . . . , m^ to be recovered. Most of the unmixing 
strategies developed in the hyperspectral imagery literature are based on this underlying geometrical 
formulation of the LMM. Indeed, the estimation of the endmembers is generally conducted in the 
lower-dimensional space lA, previously identified by a standard dimension reduction technique such 
as the principal component analysis (PCA) 1261 . However, it is well known that the model linearity is 
a simplifying assumption and does not hold anymore in several contexts, circumventing the standard 
unmixing algorithms. Specifically, non-linearities are known to occur for scenes including mixtures 
of minerals or vegetation. As a consequence, evaluating the suitability of the LMM assumption for a 
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given hyperspectral image is a capital question that can be conveniently addressed by the approach 
introduced above. 

A. Synthetic data 

First, we investigate the estimation of the subspace U when the image pixels are non-linear functions 
of the abundances. For this purpose, a 50 x 50 synthetic hyperspectral image is generated following a 
recently introduced non-linear model referred to as generalized bilinear model (GBM). As indicated in 
ETl . the GBM is notably well adapted to describe non-linearities due to multipath effects. It assumes 
that the observed pixel spectrum can be written 

R R-1 R 

ye = '^ a-r/rrir + X] X] 7jj,^«i,^«i,^"^i © '^i (30) 

r=l i=l j=i+l 

where stands for the Hadamard (termwise) product and the abundances a^,^ (r = 1, . . . , i?) satisfy 
the constraints in ( [291 ). In (l30l ). the parameters 'fiji (which belong to [0, 1]) characterize the importance 
of non-linear interactions between the endmembers mi and nij in the £-th pixel. In particular, when 
= (Vi,j), the GBM reduces to the standard LMM (|28l). Moreover, when 7^^^^^ = 1 (Vi,j), the 
GBM leads to the non-linear model introduced by Fan et al. in ESl . In this simulation, the synthetic 
image has been generated using the GBM with = 3 endmember signatures extracted from a spectral 
library. The corresponding abundances have been uniformly drawn in the set defined by the constraints 
( |29l ). We have assumed that there is no interaction between endmembers mi and 7713, and between 
endmembers m2 and m3 resulting in 7i,3,£ = 72,3/ = 0, V£. Moreover, the interactions between 
endmembers mi and m2 are defined by the map of coefficients 71^2,^ displayed in Fig. [TOl (top, left 
panel) where a black (resp. white) pixel represents the lowest (resp. highest) degree of non-linearity. 
As can be seen in this figure, 75% of the pixels (located in the bottom and upper right squares of the 
image) are mixed according to the LMM resulting in 71^2,^ = 0. The 25% remaining image pixels 
(located in the upper left square of the image) are mixed according to the GBM with nonlinearity 
coefficients 71,2/ radially increasing from to 1 (71,2/ = in the image center and 71,2/ = 1 in 
the upper left corner of the image). Note that this image contains a majority of pixels that are mixed 
linearly and belong to a common subspace of M^. Conversely, the non-linearly mixed pixels do not 
belong to this subspacefl We propose here to estimate the local subspace where a given image 
pixel and its nearest spectral neighbors vf^ live (vf^ denotes the set of the {K — l)-nearest 
neighbors of y^). 

Assuming as a first approximation that all the image pixels are linearly mixed, all these pixels 
are approximately contained in a common 2-dimensional subspace U that can be determined by 



'Assuming there is a majority of image pixels that are mixed linearly is a reasonable assumption for most hyperspectral 
images. 
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performing a PCA of yi,...,y^ (see ||29l for more details). The corresponding principal vectors 
spanning U are gathered in a matrix U. This matrix U is used as a priori knowledge regarding 
the 2-dimensional subspace containing |?/£,V^^^^ . However, this crude estimation can 

be refined by the Bayesian estimation strategy developed in the previous sections. More precisely, 
for each pixel y^, we compute the MMSD estimator of the N x p matrix Ui, whose columns 
are supposed to span the subspace Ui containing and its K — 1-nearest neig hbors ^\ The 
Bayesian estimator Ui is computed from its closed-form expression (fT4l ). i.e., using the Bingham 
prior where U has been introduced above. Then, for each pixel, we evaluate the distance between 
the two projection matrices C/^L/f and UU'^ onto the subspaces Ui = TZ {Ui \ and U = TZ {U), 



respectively. As stated in Section |lll the natural distance between these two projection matrices is 
given by (P (jJe,tJ^ = 2 — Tv ^U^LfO'^Ui^^. The resulting distance maps are depicted in 
Fig. [To] (bottom panels) for 2 non-zero values of r/ = 2a^K (as it can be noticed in (fT4l ). this 
hyperparameter r] balances the quantity of a priori knowledge U included in the estimation with 
respect to the information brought by the data). For comparison purpose, the subspace Uc has been 
also estimated by a crude SVD of ^yi,V^^ ^■'| (top right panel). In this case, tJi simply reduces 
to the associated principal singular vectors and can be considered as the MMSD estimator of 
obtained for = 0. 

These figures show that, for the 75% of the pixels generated using the LMM (bottom and right 
parts of the image), the subspace U estimated by an SVD of the whole dataset yi,. . . ,yj^ is very 
close to the hyperplanes locally estimated from jy^, V^^ | through the proposed approach (for 
any value of rj). Regarding the remaining 25% pixels resulting from the GBM (top left part of the 
image), the following comments can be made. When a crude SVD of V^^ is conducted, i.e., 
when no prior knowledge is taken into account to compute the MMSD (t] = 0, top right panel), the 
distance between the locally estimated subspace Ue and the a priori assumed hyperplane U does not 
reflect the non-linearities contained in the image. Conversely, when this crude SVD is regularized by 
incorporating prior knowledge with rj = 0.5 and i] = 50 (bottom left and right panels, respectively), 
leading to the MMSD estimator, the larger the degree of non-linearity, the larger the distance between 
IJ and Ui- To summarize, evaluating the distance between the MMSD estimator Ui and the a priori 
given matrix U allows the degree of non-linearity to be quantified. This interesting property is 
exploited on a real hyperspectral image in the following section. 

B. Real data 

The real hyperspectral image considered in this section has been acquired in 1997 over Moffett 
Field, CA, by the NASA spectro-imager AVIRIS. This image, depicted with composite true colors 
in Fig. [TT] (top, left panel), has been minutely studied in |[29l assuming a linear mixing model. The 
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scene consists of a large part of a lake (black pixels, top) and a coastal area (bottom) composed 
of soil (brown pixels) and vegetation (green pixels), leading to ii = 3 endmembers whose spectra 
and abundance maps can be found in C29il . A simple estimation of a lower-dimensional space U 
where the pixels live can be conducted through a direct SVD of the whole dataset, providing the 
a priori matrix tj. As in the previous section, this crude estimation can be refined by computing 
locally the MMSD estimators spanning the subspaces lAi (bottom panels). These estimators have 
been also computed with r? = 0, corresponding to an SVD of ^y^,vf^ (top, right figure). The 
distances between tj and have been reported in the maps of Fig. [TT] Again, for = (top, right 
panel), a simple local SVD is unable to locate possible non-linearities in the scene. However, for 
twcEl non-zero values r/ = 0.5 and = 50 (bottom left and right panels, respectively), the distances 
between the a priori recovered subspace U and the MMSD-based subspace hii clearly indicate that 
some non-linear effects occur in specific parts of the image, especially in the lake shore. Note that 
the non-linearities identified by the proposed algorithm are very similar to the ones highlighted in 
ll27l where the unmixing procedure was conducted by using the GBM defined in (l30b . This shows 
the accuracy of the proposed MMSD estimator to localize the non-linearities occurring in the scene, 
which is interesting for the analysis of hyperspectral images. 

VI. Conclusions 

This paper considered the problem of estimating a subspace using some available a priori informa- 
tion. Towards this end, a Bayesian framework was advocated, where the subspace U is assumed to be 
drawn from an appropriate prior distribution. However, since we operate in a Grassmann manifold, 
the conventional MMSE approach is questionable as it amounts to minimizing a distance which 
is not the most meaningful on the Grassmann manifold. Consequently, we revisited the MMSE 
approach and proposed, as an alternative, to minimize a natural distance on the Grassmann manifold. 
A general framework was formulated resulting in a novel estimator which entails computing the 

rr-i 

principal eigenvectors of the posterior mean of UU' . The theory was exemplified on a few simple 
examples, where the MMSD estimator can either be obtained in closed-form or requires resorting 
to an MCMC simulation method. The new approach enables one to combine efficiently the prior 
knowledge and the data information, resulting in a method that performs well at low SNR or with 
very small sample support. A successful application to the analysis of non-linearities contained in 
hyperspectral images was also presented. 



Additional results obtained with other values of iq are available online at 
|http://dobigeon.perso.enseeiht.fr/app_MMSD.html| 
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Appendix A 

The eigenvalue decomposition of / UU^pB{U)dU 

The purpose of this appendix is to prove the following proposition which can be invoked to obtain 
the MMSD estimator whenever the posterior distribution p{U\Y) is a Bingham distribution. 

Proposition 1 Let U E M^^^ be an orthogonal matrix -U^U = I- drawn from a Bingham 
distribution with parameter matrix A 

Pb{U) =ex.p{- kb{ A)} etT{U^AU} (31) 

with Kb (A) = In iFi Qp, ^N; A). Let A = Ua-^aU^ denote the eigenvalue decomposition of A 
where the eigenvalues are ordered in descending order. Let us define M = J UU'^ pB{U)dU . Then 
the eigenvalue decomposition of M writes 



M = exp{-KB(A)}[/,rj7: 



with r = ^ '^^Pi''^^-^)^ (lyid > 72 > • • • > In where 7„ = r(n, re). 

Proof: For notational convenience, let us work with the projection matrix P = UU^ whose 
distribution on the Grassmann manifold is |[T9l 



p{P) = exp {-kb(A)} etr {PA} 



(32) 



We have then that 



M = exp {-KB (A)} j Peii{PUaKaUl}dP 



j UlPUaetr{UlPUaK]dP 

j PeiT{PKa}dP 



= exp{-KB(A)}L'"a 

= exp {-KB (A)} (7a 
= exp{-KB(A)}t/amJ. 
Moreover V is diagonal since, for any orthogonal diagonal matrix D, 

rD = y" PDeiT{PKa}dP 

j D^PDetr{D^PDD^AaD}dP 



Ui 



D 



DV 
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where, to obtain the third Une, we made use of the fact that D^AaD = A^. It follows that the 
eigenvectors of M and A coincide, and that the eigenvalues of M are exp {— kb(A)} 7„, for n = 
1, • • • ,N. Moreover, it is known that exp {— kb(A)} = exp {— KB(Aa)} and, from (l32l ). one has 

exp{KB{K)} = J etr{PAa}dP. 
Differentiating the latter equation with respect to Xa{k) and denoting p„ = P{n,n), one obtains 



aexp{KB(Aa)} 

dXaik) 




= J Pketi{PAa}dP 

= Ik- 

The previous equation enables one to relate the eigenvalues of A and those of M. It remains to 
prove that 71 > 72 > • • • > ^n- Towards this end, we make use of a very general theorem due to 
Letac |[30]| . which is briefly outlined below. Let 

P{n,A){dX) = exp{K^(A)}etr {X^A} fi{dX) 

be a probability associated with a unitarily invariant measure /x on the set of N x N symmetric 
matrices. Consider the case of a diagonal matrix A = diag (ai, a2> • • • with ai > 02 > 

••• > dpf. Then |j30| proves that M = f XP{f^i, A){dX) is also diagonal, and moreover if 
M = diag (mi, 7712, • • • ,fnN) then mi > m2 > ■ ■ > rriN. Use of this theorem completes the 
proof of the proposition. ■ 

Remark 5 Most of the proposition could be proved, however in a rather indirect way, using the 
results in OTl . In this reference, Jupp and Mardia consider maximum likelihood estimation of 
the parameter matrix A from the observation of K independent matrices drawn from OT] ). 
Let P = Yl!k=i ^kU]^ and let its eigenvalue decomposition be P = VDV^ . Then the 
maximum likelihood estimate A of A has eigenvalue decomposition A = VDV^ with dn = 
9 exp {kb(-D)} /ddn- Moreover, due to Barndorff -Nielsen theorem for exponential families, one has 

tb{A) = J Pexp |Kij(A)} etr {PA^ dP = VDV^ 

which proves, since A = VDV , that 

j Pexp{Kij(D)}etr |pV£)F^}(iP = F£>F^. 

In |[3ll however, no results about the ordering of the eigenvalues was given. 
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Appendix B 

Sampling from the Bingham-von Mises Fisher distribution 

In this appendix, we show how to sample a unitary random matrix X € M^^^ from a (matrix) 
Bingham von Mises Fisher (BMF) distribution, X ~ BMF (A, B,C). As will be explained shortly, 
this amounts to sampling successively each column of X, and entails generating a random unit norm 
vector drawn from a (vector) BMF distribution. We briefly review how to sample the columns of X 
and then explain how to sample from a vector BMF distribution. 

A. The matrix BMF distribution 

The density of X ~ BMF (A, B, C) is given by 

p{X\A,B,C) eii {C^ X + BX^ AX] 

p 

oc exp {cj^Xk + B{k, k)xlAxk] (33) 

k=l 

where X = xi X2 ■■■ Xp and C = |^ci C2 ■ ■ ■ Cp ■ In 1221 a Gibbs-sampUng strategy 
was presented in order to sample from this distribution, in the case where A is full-rank. We consider 
here a situation where A is rank-deficient and therefore we need to bring appropriate modifications 
to the scheme of 1^ in order to handle the rank deficiency of A. As evidenced from (l33l) the 
distribution of A is a product of vector BMF distributions, except that the columns of X are not 
statistically independent since they are orthogonal with probability one. Let us rewrite X as X = 
xi ■■■ Xk-i Q^z Xk+i ■■■ Xp where z £ Sn~p+i = {x eR^~P^^^^;x'^x = l] and 
i& an N y. N — p + 1 orthonormal basis for TZ {X^k)'^ where stands for the matrix X with 
its fc-th column removed. As shown in [221 the conditional density of z given X_j^ is 

p {z\X^k) oc exp {cIQj_z + B{k, k)z'^Q\AQ^z] 

oc exp I c^z + 2;^ Az I (34) 

where c/c = Q\c]^ and A = B{k, k)Q\AQ ^. Therefore, z|X„jt follows a vector BMF distribution 
z|X_fc ~ vBMF ^A, Cfc^ . A Markov chain that converges to BMF (A, B, C) can thus be constructed 
as follows: 

Input: initial value X^*^^ 
I: for = 1, • • • ,p (random order) do 

2: compute a basis for the null space of X^k and set z = Q^Xk- 

3: compute = Q^c^ and A = B{k, k)Q'^AQ j_. 

4: sample z from a vBMF ^A, c^^ distribution (see next section). 

5: set Xk = Q±z. 

6: end for 
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B. The vector BMF distribution 

The core part of the above algorithm, see Une |4l is to draw a unit-norm random vector x distributed 
according to a vector Bingham-von Mises Fisher distribution. The latter distribution on the M- 
dimensional sphere has a density with respect to the uniform distribution given by 



p {x\c, A) oc exp \^c^x + x^Ax] , x G Sm- 



(35) 



In |[22l a Gibbs-sampling strategy was presented in order to sample from this distribution. While A 
was assumed to be full-rank in 123, we consider here a situation where A is rank-deficient, i.e. its 
eigenvalue decomposition can be written as A = EAE^ where E stands for the orthonormal matrix 
of the eigenvectors and A = diag (Ai, A2, • • • , A,., 0, • • • ,0) is the diagonal matrix of its eigenvalues. 
Our derivation follows along the same lines as in |[22l with the appropriate modifications due to the 
rank deficiency of A. Let y = E x & Sm and d = E c. Since yf^ = 1 — 2_/fc=i Uk^ uniform 
density in terms of the unconstrained coordinates {2/1,2/2, • • • ,yM-i} is proportional to |yA/j~^ and 
the density of {yi, 2/2, • • • , Vm-i} is given by 1221 



p{y\d,E) (xexp{cfy + y^Ay}\yM\ \ 



yli 



1 



M-l 
k=l 



M 



OC exp <^ ^ dkVk + ^ Afc^/fc > IvAil ^ (36) 
l.fc=i k=i ) 

In order to sample from this distribution, a Gibbs sampling strategy is advocated. Towards this end, 
we need to derive the conditional distributions of y^, given y_^ where y_^ stands for the vector y 
with its A;-th component removed. Similarly to 1221 . let us make the change of variables 6k = yf. 



and let q 



, so that {yf,yi,--- ,yli} = {ek,{l - Ok) q^k}- Since 

1/2 

this change of variables is not bijective, i.e. y^ ± 0^/ , we need to introduce the sign of y^, 

^ ,2 _ T Y-A/-l^,2 fl.U/2„l/2 



and we let s 



si S2 ■■■ sm\ ■ Note that yf, = 1 - Ztii vl bh-A = (1 - ^fc)'/'?;^ and 
qM = 1 — X]£=Ti'^fc 9^- ^s shown in f22l, the Jacobian of the transformation from {yi, y2, • • • , yA/-i} 



to {9, gi, • • • , qk-i,qk+i, ■■■ , qM~~i} is proportional to 9j, (1 - 9^) 
therefore the joint distribution of 9k, Sk, q^kj can be written as 



(M-2)/2 T-rM-l 
11^=1 



-1/2 



ij^k' 



, and 



P {9k, Sk, q^k, s-fc) oc 6*^. (1 - 6*, 



.(A/-3)/2 



.i^k 



'1/2 



X exp < 



Ske]!^dk + {l-9k)''^Y. 



1/2 



X < 



exp ^9k\k + (1 - 9k) YJi=i,i^k li^f] 1 <k <r 
exp{(l-0fc)ELi9fAa r + l<k<M 



(37) 



It follows that 
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for k € [1, r] 



X exp <j SkO]!'^dk + (1 - Ok) 



1/2 



g 



1/2 



T 



(38) 



for € [r + 1, Af] 



\{M-Z)/2 



exp{(l-0fe)q^A} 



1 /2 

X exp <! Sfc6';.' 4 + (1 - 



1/2 



1/2' 

s-k 



(39) 



1/2 

In the previous equations, stands for the element-wise vector or matrix product and is a short- 

r 



1/2 1/2 1/2 1/2 

9i ■■■ %+i ••• 1m 



hand notation to designate the vector 
P i&k: Sk\(l-k^ ^ -k) ^ we first sample 9k from 

p{9k\q-k,s^k) =p{dk,Sk = +p{9k,Sk = l\q_k,s_k) 

oc 9-'/' (1 - 1^^^^ ^ _ ^^)l/2| 

X exp I -40].''^ I + exp I -46*^/^ I 



. In order to sample from 



(40) 



where bk 



1/2 

S-k o q_k 



and 



(41) 



-q^X k e[r + l,M] 

Next, we sample Sk € {— 1,+1} with probabilities proportional to ^6"'^'=^'=^ ^e+'^''^'=' j. In order to 
sample from the distribution in (l40l ). an efficient rejection sampling scheme was proposed in ll22l . 
where the proposal distribution is a beta distribution with suitably chosen parameters. 
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Fig. 4. Fraction of energy of (7 in 7?, (U) versus K. N = 20, p — 5, n = 20 and SNR = 5dB. Linear model, Bingham 
prior. 




Fig. 5. Fraction of energy of C/ in 7?. {U) versus SNR. N = 20, p = 5, k = 20 and K — 5. Linear model, Bingham 
prior. 
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Fig. 8. Fraction of energy of J7 in 7^ (U) versus K. N = 20, p = 5, k = 20, SNR- = 5dB and SNR+ = lOdB. 
Covariance matrix model, Bingham prior. 




Fig. 9. Fraction of energy of [7 in 7^ {U) versus K. N = 20, p = 5, k = 20, SNR- = 5dB and SNR+ = lOdB. 
Covariance matrix model, vMF prior. 
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Fig. 10. Top, left: non-linearity coefficients 71,2. Top, right: distance between U and Un estimated with 77 = 0. Bottom: 
distance between TJ and TJ i estimated with 77 = 0.5 (left) and 77 = 50 (right). 







Fig. 11. Top, left: The Moffett Field scene as composite true colors. Top, right: distance between U and Un estimated 
with r; = 0. Bottom: distance between fj and TJ i estimated with 77 = 0.5 (left) and 77 = 50 (right). 



